32. Exploiting Data Level Parallelism

第32章利用数据级并行性
AP SHANTHI博士

本模块的目标是讨论如何在处理器中利用数据级并行性。我们将讨论矢量架构、SIMD 指令和图形处理单元 (GPU) 架构。

我们在前面的模块中讨论了利用指令级并行性和线程级并行性的不同技术。我们现在将讨论利用数据级并行性的不同类型的体系结构，即 SIMD 风格的体系结构。SIMD 架构可以利用重要的数据级并行性，不仅用于面向矩阵的科学计算，还用于面向媒体的图像和声音处理，这在当今非常流行。此外，SIMD 比 MIMD 更节能，因为我们每次数据操作只需要获取一条指令。这也使 SIMD 对个人移动设备具有吸引力。图 36.1 显示了随着时间的推移，x86 计算机通过 MIMD、SIMD 以及 MIMD 和 SIMD 并行化的潜在加速。该数字假设 MIMD 的每个芯片将每两年增加两个内核，并且 SIMD 类型计算的操作数量将每四年翻一番。SIMD 的潜在加速预计是 MIMD 的两倍。因此，我们有必要讨论 SIMD 架构的细节。

SIMD 基本上有三种变体。他们是：

• 矢量架构

• SIMD 扩展

    • 图形处理器单元(GPU) 我们将详细讨论其中的每一个。

矢量架构：这是最古老的 SIMD 架构风格，广泛用于当时的超级计算机。由于所需的晶体管数量和内存带宽，它们被认为过于昂贵而无法在微处理器中实现。然而，现在已经不是这样了。矢量体系结构基本上对数据矢量进行操作。它们将分散在多个内存位置的数据收集到一个大型向量寄存器中，独立操作数据元素，然后将数据存储回各自的内存位置。这些大型寄存器文件由编译器控制，用于隐藏内存延迟并利用内存带宽。

作为一个示例架构，为了理解向量处理器的概念，我们将看一下VMIPS架构，其标量版本是我们已经熟悉的MIPS架构。VMIPS ISA 的主要组件如图 36.2 所示。它们如下：

1. 向量寄存器：这些寄存器中的每一个都保存一个向量。在 VMIPS 架构的情况下，有八个寄存器，每个寄存器保存一个 64 元素、64 位/元素向量。为了给功能单元提供数据并提供足够的重叠，寄存器文件有16个读端口和8个写端口，通过交叉开关连接到功能单元的输入或输出。

2. 矢量功能单元：这些功能单元是完全流水线化的，能够在每个时钟周期开始一个操作。控制单元检测结构 危险和数据危险。图中标出了五个功能单元。

3. 矢量加载存储单元：该单元处理要从内存加载和加载到内存的矢量数据。它们是完全流水线化的单元，在初始延迟之后每个时钟周期具有一个字的带宽。该单元还处理标量加载和存储。

4. 标量寄存器：这些寄存器可以将输入传递给向量功能单元，还可以计算向量加载/存储操作的地址。有32个通用寄存器和32个浮点寄存器。当标量值从标量寄存器文件中读出时，矢量功能单元的一个输入锁存标量值。

5、VMIPS的ISA除了MIPS的标量指令外，还由向量指令组成。例如，ADDVV.D 指令添加两个向量，ADDVS.D 向标量添加一个向量，LV/SV 从 MIPS 通用寄存器中指示的起始地址执行向量加载和向量存储。除此之外，还有两个通用寄存器，向量长度不同时使用的向量长度寄存器和条件执行时使用的向量掩码寄存器。

为了更清楚地了解向量运算，请考虑以下代码序列来执行运算 Y = a * X + Y。

DAXPY（双精度 a*X+Y）

LD F0，一个；加载标量 a

LV V1，接收；载荷矢量 X

MULVS.D V2、V1、F0；矢量标量乘法

LV V3，Ry；载荷矢量 Y

ADDVV V4、V2、V3；添加

SV Ry，V4；存储结果

相应的 MIPS 代码如下所示。

LD F0, a ; 加载标量 a

DADDIUR4，接收，#512；最后加载地址

循环：LD F2, 0(Rx); 加载 X[i]

MUL.D F2, F2, F0; 斧 X[i]

LD F4,0(Ry); 加载 Y[i]

ADD.D F4, F2, F2; 轴 X[i] + Y[i]

SD F4, 9(Ry); 存储到 Y[i]

DADDIU Rx, Rx, #8; 将索引增加到 X

DADDIU Ry，Ry，#8；将索引增加到 Y

SUBBU R20、R4、Rx；计算边界

BNEZ R20，循环；检查是否完成

It can be easily seen that VMIPS requires only 6 instructions, whereas, MIPS takes about 600 instructions. This is because of the vector operations done (which would have otherwise been done in an iterative manner) and the removal of the loop overhead instructions. Another important difference is the frequency of interlocks. In the case of the MIPS code, the dependences will cause stalls, whereas, in the case of VMIPS, each vector instruction will stall only for the first vector element. Thus, pipeline stalls will occur only once per vector instruction, rather than vector element. Although loop unrolling and software pipelining can help reduce the stalls in MIPS, there will not be a significant reduction.

Vector Execution Time: The vector execution time depends on three factors

–  Length of operand vectors

–  Structural hazards

–  Data dependences

Given the length of the vector and the initiation rate, which is the rate at which a vector unit consumes new operands and produces new results, we can calculate the time taken to execute a vector instruction. With just one lane and the initiation rate of one element per clock cycle, the execution time for one vector instruction is approximately the vector length. We normally talk about convoys, which is the set of vector instructions that could execute together. We define chime as the unit of time to execute one convoy. So, m convoys execute in m chimes, and if the vector length is n, it takes m x n clock cycles to execute. Look at the following code sequence:

LV V1, Rx; load vector X

MULVS.D V2,V1,F0 ;vector-scalar

LV V3,Ry ; load vector Y

ADDVV.D V4,V2,V3 ; add two vectors

SV Ry, V4 ; 存储总和

有 3 个车队，所以有 3 个 chime，每个结果有 2 个浮点运算。因此，每个 FLOP 的周期数 = 1.5。对于 64 个元素向量，它需要 64 x 3 = 192 个时钟周期。当然，我们假设具有先读后写依赖风险的序列可以通过链接在同一个车队中。一旦矢量源操作数的各个元素可用，链接一个矢量操作就可以开始。例如，当第一个向量元素从内存中加载时，它被链接到乘法器单元。类似地，第二个加载和乘法的输出链接到加法器单元。这里讨论的 chime 模型忽略的最重要的开销是向量启动时间，这取决于向量的延迟 功能单元。

提高矢量处理器的性能：在讨论了矢量架构的基础知识之后，我们现在将看看矢量架构通常采用的方法来提高性能。通常会进行以下优化。

多通道：此优化着眼于具有并行通道，以便每个时钟周期生成多个元素。图 36.3 显示了如何使用多个通道。左边的处理器只有一个通道（一个加法流水线）并且每个时钟周期只产生一个加法。右边的有四个通道，每个时钟周期产生四个加法。向量的元素在多个通道中交错。

图 36.4

Figure 36.4 shows how the four lanes are formed. The vector register storage is divided among the four lanes, with each lane holding every fourth element of the vector register. The adder unit, the multiplier unit and the load/store unit work together on four parallel lanes. The arithmetic units have four execution pipelines. Adding multiple lanes is a popular way of increasing the performance of vector processors as it does not increase the control and design complexity too much.

Vector Length Registers to handle non-64 wide vectors: The natural vector length depends on the length of the vector registers. In our case, it is 64. But, this may not be true always and the length may not even be known at compile time. Therefore, processors look at the usage of Vector Length Register (VLR). The length of the vector can be loaded into this register. This will work as long as the real length is less than the maximum vector length, which the vector register length. In cases where the real leng th is greater than the maximum length, the technique of strip mining is used. Here, several iterations that are multiples of the maximum length are done and the remaining iterations are done separately. Figure 36.5 illustrates the idea.

Vector Mask Registers to handle IF statements in vector code: Sometimes, we may not want to execute the operation on all the elements of the vector. Depending on some condition, we may want to operate on certain elements alone. This is done with the help of a vector mask register, which has one bit per element. Depending on whether the masking bit is set or not for each element, the operation is carried out for that element.

Memory Banks – Memory system optimizations to support vector processors : As the memory bandwidth demands are very high for vector processors, memory interleaving and independent memory banks are used. Support for controlling the bank addresses independently, Loading or storing non sequential words and multiple vector processors sharing the same memory are provided.

Stride: Handling Multiple Dimensional Matrices in Vector Processors: Strided addressing skips a fixed number of words between each access, so that data is accessed with non-unit strides and stored in a vector register. Sequential addressing is often called unit stride addressing. Once the accessing is done with a different stride length and stored in the register, the operations on these elements are carried out in the usual manner.

Scatter-Gather: To Handle Sparse matrices: Many a times we need to handle sparse data. Gather and scatter operations help collecting the data and then storing them back using index vectors. A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector. After performing the operations on these elements in a dense form, the sparse vector can be stored in an expanded form using the same index vector.

编程向量体系结构：影响性能的程序结构：这种体系结构的一个优点是它提供了一个简单的执行模型，编译器可以告诉程序员哪些部分已经被向量化，哪些部分没有被向量化。编译器可以向程序员提供反馈并且程序员可以向编译器提供提示。这种透明度使改进成为可能。当然，很大程度上取决于程序结构本身——依赖性、选择的算法以及它们的编码方式。

SIMD 指令集扩展： SIMD 多媒体扩展始于简单的观察，即许多媒体应用程序对宽度比原生字长更窄的数据类型进行操作。例如，一个加法器可以划分为更小的加法器。但是，与非常优雅的向量指令不同，与向量指令相比，SIMD 扩展具有以下限制：

1.数据操作数的数量被编码到操作码本身，因此指令数量急剧增加。没有使用向量长度寄存器。

2. 不支持复杂的寻址模式，例如跨步访问和分散-聚集访问。

3. 不支持条件执行，因为不支持屏蔽寄存器。

4、这些都使得编译器更难生成SIMD代码，增加了SIMD汇编语言编程的难度

以下是已实现的 SIMD 扩展：

英特尔 MMX (1996)
八个 8 位整数操作或四个 16 位整数操作
流式 SIMD 扩展 (SSE) (1999)
八个 16 位整数操作
四个 32 位整数/fp 操作或两个 64 位整数/fp 操作
高级矢量扩展 (2010)
四个 64 位整数/fp 操作
操作数必须是连续且对齐的内存位置
通常旨在加速精心编写的库而不是编译器
 
尽管存在缺点，但 SIMD 扩展与矢量架构相比具有一定的优势：

加入标准ALU成本低，易于实现
它几乎不需要额外的状态，这意味着它也很容易进行上下文切换 o 它几乎不需要额外的内存带宽
不存在跨页访问和page-fault的虚拟内存问题
图 36.6 显示了使用 SIMD 扩展指令的 DXPY 循环代码。

图形处理单元 (GPU)：利用数据级并行性的第三种架构风格是 GPU。GPU 具有高度并行的操作，在媒体应用程序中变得非常流行。这与可用于对 GPU 进行编程的编程语言的可用性相结合，开辟了使用 GPU 进行通用计算的选项。

GPUs support a heterogeneous execution model, with the CPU as the host and the GPU as the device. The challenge lies in partitioning the computations between the CPU and the GPU and the usage of the system memory/device memory. GPUs have every type of parallelism that can be captured by the programming environment – MIMD, SIMD, multithreading and ILP.

NVIDIA has developed a C like language to exploit all sorts of parallelism and make the programming of GPUs easy through their Compute Unified Device Architecture (CUDA). OpenCL is a vendor independent programming language for multiple platforms. All this has made the programming of GPUs easier and the increased popularity of GP-GPU Computing.

NVIDIA has unified all forms of GPU parallelism as a CUDA thread. The compiler and the hardware can together combine thousands of such threads to exploit all forms of parallelism. The programming model is “Single Instruction Multiple Thread”. A thread is associated with each data element. Threads are organized into blocks. We can have Thread Blocks of up to 512 elements. A Multithreaded SIMD Processor is the hardware that executes a whole thread block (32 elements executed per thread at a time) . The thread blocks are organized into a grid and blocks are executed independently and in any order. Different blocks cannot communicate directly but can coordinate using atomic memory operations in Global Memory. The GPU hardware itself handles the thread management, rather than the applications or OS. A multiprocessor is composed of multithreaded SIMD processors. A Thread Block Scheduler is responsible for scheduling the threads.

A GPU architecture is similar to vector architectures in that they work well with data-level parallel problems, have support for scatter-gather transfers, have support for mask registers and have large register files. The differences are that they do not have a scalar processor, use multithreading to hide memory latency and have many functional units, as opposed to a few deeply pipelined units .

例如，让我们考虑两个长度为 8192 的向量的乘法。处理所有 8192 个元素的代码是网格。线程块将其分解为可管理的大小，例如 512 个元素/块。因此，我们有 16 个 SIMD 线程/块，32 个元素/线程。SIMD 指令一次执行 32 个元素。因此，网格大小为 16 个块，其中一个块类似于向量长度为​​ 32 的带状挖掘向量循环。一个块由线程块调度程序分配给多线程 SIMD 处理器。线程调度器使用记分板进行调度，线程之间不应存在任何数据依赖关系。多线程 SIMD 处理器的数量因不同的 GPU 而异。像 Fermi 这样的当代 GPU 有 7-15 个多线程 SIMD 处理器。这种映射如图 36.7 所示。在每个 SIMD 处理器中，有 32 个 SIMD 通道，与矢量处理器相比，它们既宽又浅。从程序员的角度来看，内核是可以并行执行的指令序列，线程是执行单个内核的基本执行单元，线程块是执行同一内核的线程的集合， 网格是执行相同内核的线程块的集合。一个 CUDA 程序可以启动多个网格，每个所需的并行任务一个。

GPU Memory Structures: There are different memories that are supported in a GPU architecture. Each SIMD lane has a private section of off-chip DRAM, called the private memory, which is used to contain stack frames, spilling registers, and private variables. Each multithreaded SIMD processor also has a local memory, which is shared by SIMD lanes / threads within a block. The memory shared by all the SIMD processors is the GPU memory and the host can read and write GPU memory. Figure 36.8 shows the memory hierarchy of the CUDA threads.

Fermi Architecture Case Study: Figure 36.9 shows the floor plan of the Fermi GTX 480 GPU. There are 16 multithreaded SIMD Processors. The Thread Block Scheduler is highlighted on the left. The GTX 480 has 6 GDDR5 ports, each 64 bits wide, supporting up to 6 GB of capacity. The Host Interface is PCI Express 2.0 x 16. Giga Thread is the name of the scheduler that distributes thread blocks to Multiprocessors, each of which has its own SIMD Thread Scheduler.

NVIDIA 工程师分析了数百个着色器程序，这些程序显示越来越多地使用标量计算，并意识到很难有效利用矢量架构的所有处理单元。他们还估计，基于标量设计的架构效率，使用 128 个标量处理器的标量架构与使用 32 个四分量向量处理器的标量架构相比，可以实现多达 2 倍的性能提升。所以从 G80 开始，NVIDIA 转向 基于标量处理器的设计。

CUDA 内核，如图 36.10 所示，是一个简单的标量处理器，它具有一个完全流水线化的算术逻辑单元 (ALU) 和一个浮点单元 (FPU)。它每个时钟执行一条浮点或整数指令。它有一个非常简单的指令流水线。Fermi 架构由 512 个 CUDA 内核组成。流式多处理器 (SM) 是共享相同 L1 缓存的 32 个 CUDA 内核的集合。32 个 CUDA 内核进一步分为两组，每组 16 个内核。有 16 个加载/存储单元可以访问来自缓存/DRAM 的数据。4 个特殊功能单元 (SFU) 执行复杂指令，如正弦、余弦、倒数、平方根等。

Fermi 支持两级调度器。在芯片级，全局调度器将线程块调度到各个 SM，而在 SM 级，warp 调度器将 32 个线程的 warp 分配到其执行单元。Fermi 支持并发内核执行，同一应用上下文的不同内核可以同时在 GPU 上执行。并发内核执行允许执行多个小内核的程序利用整个 GPU。SM 以 16 个并行线程为一组调度线程，称为扭曲。每个 SM 具有两个扭曲调度器和两个指令调度单元，允许同时发布和执行两个扭曲。Fermi 的双扭曲调度器选择两个扭曲，并从每个扭曲向一组十六个内核、十六个加载/存储单元或四个 SFU发出一条指令 。

The per-SM L1 cache is configurable to support both shared memory and caching of local and global memory operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16 KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with 48 KB of shared memory, programs that make extensive use of shared memory (such as electro-dynamic simulations) can perform up to three times faster. For programs whose memory accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved performance over direct access to DRAM. The innovations in the Fermi architecture include fast double precision, caches for GPU memory, 64-bit addressing and unified address space, support for error correcting codes, faster context switching and faster atomic instructions.

CUDA 编程模型，涉及线程、线程块和网格，直接映射到 Fermi 硬件架构，涉及 CUDA 内核、SM 和整个 GPU。如图 36.11 所示。

NVIDIA GPU ISA： NVIDIA 编译器的指令集目标是硬件指令集的抽象。并行线程执行 (P TX) 是一种低级虚拟机和 ISA，旨在支持并行线程处理器的操作。它为编译器提供了稳定的指令集，并且兼容多代 GPU 架构。硬件指令集对程序员是隐藏的。PTX 指令描述了单个 CUDA 线程上的操作，通常映射到硬件指令。此外，一个 PTX 可以映射到多个指令，反之亦然。在程序安装时，GPU 驱动程序将 PTX 指令转换为机器指令。

概括：

本模块的要点总结如下：

矢量体系结构、SIMD 风格的体系结构或 SIMD 扩展和图形处理单元利用了应用程序中存在的数据级并行性。
矢量架构支持矢量寄存器、矢量指令、深度流水线功能单元和流水线内存访问。
矢量处理器支持条件执行、不同长度的矢量、分散-聚集操作、链接等特殊功能。
处理器中的 SIMD 扩展利用 DLP 对硬件进行最小更改。
它们对于流应用程序非常有用。
GPU 尝试利用所有类型的并行性并形成异构架构。
有多个具有多线程执行的 SIMD 通道。
内存层次结构支持数据的高效访问。
支持 PTX 低级虚拟机执行。
Fermi 架构是作为 GPU 架构的一个案例研究完成的。
 
网页链接/支持材料

Computer Architecture – A Quantitative Approach，John L. Hennessy 和 David A. Patterson，第 5 版，Morgan Kaufmann，Elsevier，2011。
NVIDIA 白皮书：
http://www.nvidia.in/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf